Welcome to another tutorial for this class, COMP/STAT 112: Introduction to Data Science! It will be similar to the others, including demo videos and files embedded in this document and practice problems with hints or solutions at the end. There are some new libraries, so be sure to install those first.
As most of our files do, we start this one with three R code chunks: 1. options, 2. libraries and settings, 3. data.
knitr::opts_chunk$set(echo = TRUE,
message = FALSE,
warning = FALSE)
library(tidyverse) # for data cleaning and plotting
library(googlesheets4) # for reading googlesheet data
library(lubridate) # for date manipulation
library(openintro) # for the abbr2state() function
library(palmerpenguins)# for Palmer penguin data
library(maps) # for map data
library(ggmap) # for mapping points on maps
library(gplots) # for col2hex() function
library(RColorBrewer) # for color palettes
library(sf) # for working with spatial data
library(leaflet) # for highly customizable mapping
library(ggthemes) # for more themes (including theme_map())
library(plotly) # for the ggplotly() - basic interactivity
library(gganimate) # for adding animation layers to ggplots
library(gifski) # for creating the gif (don't need to load this library every time,but need it installed)
library(transformr) # for "tweening" (gganimate)
library(shiny) # for creating interactive apps
library(patchwork) # for nicely combining ggplot2 graphs
library(gt) # for creating nice tables
library(rvest) # for scraping data
library(robotstxt) # for checking if you can scrape data
gs4_deauth() # To not have to authorize each time you knit.
theme_set(theme_minimal())
# Lisa's garden data
garden_harvest <- read_sheet("https://docs.google.com/spreadsheets/d/1DekSazCzKqPS2jnGhKue7tLxRU3GVL1oxi-4bEM5IWw/edit?usp=sharing") %>%
mutate(date = ymd(date))
After this tutorial, you should be able to do the following:
Import data into R that is stored in a common file type (.csv, .txt, excel, etc) or in a Google spreadsheet.
Find resources to read in data that is in a format other than one of the more common formats.
Use rvest() functions to scrape data from a simple webpage and recognize when scraping the data will require more advanced tools.
Create nice tables with the gt functions.
Use patchwork to display related plots together nicely.
In this section, we’ll learn some of the common ways we can import data into R. Many of these functions you have already used and others you may not ever need to use. So, this will be a pretty quick overview.
The table below lists some common import functions and when you would use them.
| Function | Use when |
|---|---|
read_csv() |
data are saved in .csv (comma delimited) format - you can save Excel files and Google Sheets in this format |
read_delim() |
data are saved in other delimited formats (tab, |
read_sheet() |
data are in a Google Sheet |
st_read() |
reading in a shapefile |
After reading in new data, it is ALWAYS a good idea to do some quick checks of the data. Here are some things I always do:
Open the data in the spreadsheet-like viewer and take a look at it. Sort it by different variables by clicking on the arrows next to the variable name. Make sure there isn’t anything unexpected.
Do a quick summary of the data. The code below is one of the things I almost always do because it’s quick. For quantitative variables, it tells me some summary statistics and will let me know if there are missing values. For factors (they need to be factors, not just character variables - the mutate() changes them to factors), it shows you counts for the top categories and tells you if there are any missing values.
garden_harvest %>%
mutate(across(where(is.character), as.factor)) %>%
summary()
## vegetable variety date
## tomatoes :232 grape : 37 Min. :2020-06-06
## lettuce : 68 Romanesco : 34 1st Qu.:2020-07-21
## beans : 38 pickling : 32 Median :2020-08-09
## zucchini : 34 Lettuce Mixture : 28 Mean :2020-08-08
## cucumbers: 32 Farmer's Market Blend: 27 3rd Qu.:2020-08-26
## peas : 27 Bonny Best : 26 Max. :2020-10-03
## (Other) :254 (Other) :501
## weight units
## Min. : 2.0 grams:685
## 1st Qu.: 86.0
## Median : 252.0
## Mean : 501.8
## 3rd Qu.: 599.0
## Max. :7350.0
##
When reading in data from a file I created, I will often use the Import Wizard to help me write the code. DO NOT use it to import the data as you will need the code to read in the data in order to knit your document. Watch the quick video below of how I use it.
readr documentation and data import cheatsheet
Read in the fake garden harvest data. Find the data here and click on the Raw button to get a direct link to the data.
Read in this data from the kaggle website. You will need to download the data first. Do some quick checks of the data to assure it has been read in appropriately.
While a great deal of data is available via Web APIs and data warehouses, not all of it is. Programs can use a process called web scraping to collect data that is available to humans (via web browsers) but not computer programs.
paths_allowed() function from the robotstxt library to check to see if you can scrape data from a webpage before you begin. I check four pages below. This tells me that I cannot scrape the second webpage but can scrape the other onespaths_allowed(paths = "https://www.macalester.edu/registrar/schedules/2017fall/class-schedule/#crs10008")
## [1] TRUE
paths_allowed(paths = "https://www.zillow.com/homes/55104_rb/")
## [1] FALSE
paths_allowed(paths = "https://www.billboard.com/charts/hot-100")
## [1] TRUE
paths_allowed("https://salsacycles.com/bikes")
## [1] TRUE
In order to gather information from a webpage, we must learn the language used to identify patterns of specific information. For example, on the Macalester Registrar’s Fall 2017 Class Schedule you can visually see that the data is represented in a table. The first column shows the course number, the second the title, etc.
We will identify data in a webpage using a pattern matching language called CSS Selectors that can refer to specific patterns in HTML, the language used to write web pages. For example, the CSS selector “a” selects all hyperlinks in a webpage (“a” represents “anchor” links in HTML), “table > tr > td:nth-child(2)” would find the second column of an HTML table.
I will illustrate how to find these attributes using some tools that are available in the Chrome web browser. You should install the Selector Gadget for Chrome (the video on that same page can be useful). With this, you “teach” the Selector Gadget which data you are interested in on a web page, and it will show you the CSS Selector for this data. We will also use developer tools to find the selectors.
Head over to the Macalester Registrar’s fall 2017 class schedule. Click the selector gadget icon in the top right corner of Chrome. As you mouse over the webpage, different parts will be highlighted in orange. Click on the first course number, AMST 101-01. You’ll notice that the Selector Gadget information in the lower right describes what you clicked on:
Scroll through the page to verify that only the information you intend (the course number) is selected. The selector panel shows the CSS selector (.class-schedule-course-number) and the number of matches for that CSS selector (762).
We can also do this using the Developer Tools. On the webpage, right click and choose inspect. On the Elements tab, click the select an element icon in the upper left-hand corner. Then, go click on AMST 101-01 on the webpage. You should see something like the image below.
Now that we have the selector for the course number, let’s find the selector for the days of the week. Clear the selector by clicking the “Clear” button on the result pane, and then click the W under days for AMST 101-01. You will notice that the selector was too broad and highlighted information we don’t want. You need to teach Selector Gadget a correct selector by clicking the information you don’t want to turn it red. Once this is done, you should have 762 matches and a CSS selector of .class-schedule-course-title+ .class-schedule-label.
When I use the Developer Tools and highlight a class day, I see the following, which seems to indicate that the selector is td.class-schedule-label. Notice that other fields (like Instructor) show this same selector.
rvest and CSS SelectorNow that we have identified CSS selectors for the information we need, let’s fetch the data in R. We will be using the rvest package, which retrieves information from a webpage and turns it into R data tables.
First, we read in the webpage.
fall2017 <- read_html("https://www.macalester.edu/registrar/schedules/2017fall/class-schedule/#crs10008")
Once the webpage is loaded, we can retrieve data using the CSS selectors we specified earlier. The following code retrieves the course numbers and names as a vector and puts them in a dataset (tibble) called course_df. The html_nodes() function allows us to identify nodes in a variety of ways. See the “Finding elements with CSS selectors” section of this tutorial for more information.
# Retrieve and inspect course numbers
course_nums <-
fall2017 %>%
html_nodes(".class-schedule-course-number") %>%
html_text()
head(course_nums)
## [1] "AMST 101-01" "AMST 103-01" "AMST 200-01" "AMST 203-01" "AMST 219-01"
## [6] "AMST 229-01"
# Retrieve and inspect course names
course_names <-
fall2017 %>%
html_nodes(".class-schedule-course-title") %>%
html_text()
head(course_names)
## [1] "Explorations of Race and Racism"
## [2] "The Problems of Race in US Social Thought and Policy"
## [3] "Critical Methods for American Studies Research"
## [4] "Politics and Inequality: American Welfare State"
## [5] "In Motion: African Americans in the United States"
## [6] "Narrating Black Women's Resistance"
course_df <- tibble(number=course_nums, name=course_names)
head(course_df)
Next, let’s try to grab the day of the week. First, we’ll do it using the selector we found with the Selector Gadget.
course_days <- fall2017 %>%
html_nodes(".class-schedule-course-title+ .class-schedule-label") %>%
html_text()
head(course_days)
## [1] "Days: W" "Days: MWF" "Days: MWF" "Days: MWF" "Days: MWF" "Days: MWF"
This looks pretty good, although we would like to get rid of the “Days:” at the beginning. We’ll come back to that in a minute.
Let’s see what happens when we try to use the selector we found using developer tools.
fall2017 %>%
html_nodes("td.class-schedule-label") %>%
html_text() %>%
head()
## [1] "Days: W" "Time: 07:00 pm-10:00 pm"
## [3] "Room: ARTCOM 102" "Instructor: Gutierrez, Harris"
## [5] "Avail./Max.: Closed -5 / 25" "Days: MWF"
This returns much more than what we want. That is because there are five fields that use that selector. So, we need to be more specific. One way we can do this is by identifying which “child” it is. If we look at the children of the “parent” node, we see that “Days” is the 3rd child. If you are used to html, you likely could have figured that out without doing this step.
fall2017 %>%
html_node("table tr") %>% #just look at the first one
html_children() %>%
html_text()
## [1] "Number / Section" "Name" "Days" "Time"
## [5] "Room" "Instructor" "Avail. / Max."
We can use that information to make a more specific selector.
fall2017 %>%
html_nodes("td.class-schedule-label:nth-child(3)") %>%
html_text() %>%
head()
## [1] "Days: W" "Days: MWF" "Days: MWF" "Days: MWF" "Days: MWF" "Days: MWF"
We would also like to get rid of the “Days:” at the beginning. We can do that with str_sub() from the stringr library.
course_days <- fall2017 %>%
html_nodes("td.class-schedule-label:nth-child(3)") %>%
html_text() %>%
str_sub(start = 7)
head(course_days)
## [1] "W" "MWF" "MWF" "MWF" "MWF" "MWF"
This example aims to show a couple techniques not shown in the previous example. We will be examining Salsa brand bikes.
Initially, we just read in the webpage, like we did before.
salsa_url <- "https://salsacycles.com/bikes"
salsa <- read_html(salsa_url)
Below I pull the name of the bike. Note that this returns the same results whether or not I comment out the 2nd line of code or not.
salsa %>%
html_nodes("div.small-6.large-3.columns.left.bike-listing") %>%
html_nodes(".title") %>%
html_text()
## [1] "WARBIRD" "VAYA" "MARRAKESH" "FARGO"
## [5] "CUTTHROAT" "JOURNEYMAN 700c" "JOURNEYMAN 650b" "WARROAD"
## [9] "STORMCHASER" "CASSIDY" "BLACKTHORN" "HORSETHIEF"
## [13] "SPEARFISH" "RUSTLER" "TIMBERJACK" "TIMBERJACK KIDS"
## [17] "RANGEFINDER" "BEARGREASE" "MUKLUK" "BLACKBOROW"
Below I pull the classification of the bike. Try commenting out the second line of code. What happens? Try removing the str_trim(). What happens?
salsa %>%
html_nodes(".small-6.large-3.columns.left.bike-listing") %>%
html_nodes(".classification") %>%
html_text() %>%
str_trim()
## [1] "Gravel Racing"
## [2] "Gravel/Light Touring"
## [3] "World Touring"
## [4] "Off-Road Touring/Bikepacking"
## [5] "Ultra Endurance Bikepacking/Gravel"
## [6] "ALL-ROAD"
## [7] "ALL-ROAD"
## [8] "Endurance Road"
## [9] "Single Speed Gravel"
## [10] "29\" Full-Suspension Enduro"
## [11] "29\" Full-Suspension All-Mountain"
## [12] "29″ Full Suspension Trail"
## [13] "29″/27.5+ Full Suspension XC"
## [14] "27.5″ Full Suspension Trail"
## [15] "Hardtail Trail"
## [16] "Kids Hardtail Adventure"
## [17] "Approachable Hardtail Trail"
## [18] "Groomed Racing"
## [19] "Maximum Floatation"
## [20] "OFF-ROAD EXPEDITION / WORLD TOURING / CREATIVE THINKING"
What I would really like is some more detailed information for each of these bikes, like their prices, sizes, etc. But that information is on each bikes’ webpage, eg. WARBIRD.
I can collect the piece of the url that will link to each bikes’ page…
bike_pages <-
salsa %>%
html_nodes(".small-6.large-3.columns.left.bike-listing a") %>%
html_attr("href")
bike_pages
## [1] "/bikes/warbird" "/bikes/vaya" "/bikes/marrakesh"
## [4] "/bikes/fargo" "/bikes/cutthroat" "/bikes/journeyman"
## [7] "/bikes/journeyman_650b" "/bikes/warroad" "/bikes/stormchaser"
## [10] "/bikes/cassidy" "/bikes/blackthorn" "/bikes/horsethief"
## [13] "/bikes/spearfish" "/bikes/rustler" "/bikes/timberjack"
## [16] "/bikes/timberjack_kids" "/bikes/rangefinder" "/bikes/beargrease"
## [19] "/bikes/mukluk" "/bikes/blackborow"
And combine that with the main Salsa page to form the url for each bike.
url <- paste("https://salsacycles.com", bike_pages, sep = "")
url
## [1] "https://salsacycles.com/bikes/warbird"
## [2] "https://salsacycles.com/bikes/vaya"
## [3] "https://salsacycles.com/bikes/marrakesh"
## [4] "https://salsacycles.com/bikes/fargo"
## [5] "https://salsacycles.com/bikes/cutthroat"
## [6] "https://salsacycles.com/bikes/journeyman"
## [7] "https://salsacycles.com/bikes/journeyman_650b"
## [8] "https://salsacycles.com/bikes/warroad"
## [9] "https://salsacycles.com/bikes/stormchaser"
## [10] "https://salsacycles.com/bikes/cassidy"
## [11] "https://salsacycles.com/bikes/blackthorn"
## [12] "https://salsacycles.com/bikes/horsethief"
## [13] "https://salsacycles.com/bikes/spearfish"
## [14] "https://salsacycles.com/bikes/rustler"
## [15] "https://salsacycles.com/bikes/timberjack"
## [16] "https://salsacycles.com/bikes/timberjack_kids"
## [17] "https://salsacycles.com/bikes/rangefinder"
## [18] "https://salsacycles.com/bikes/beargrease"
## [19] "https://salsacycles.com/bikes/mukluk"
## [20] "https://salsacycles.com/bikes/blackborow"
We have not used square bracket notation too much in this class, but it references specific elements of a vector. So, I can read in the first webpage, and grab the bike name. (Note that this is just giving the info for the first bike displayed on that page. I would have to do more work to dig for all of them.)
url[1] %>%
read_html() %>%
html_nodes("h1.bike-title") %>%
html_text()
## [1] "Warbird Carbon GRX 810 Di2"
And I can find the price of that bike. We would have to do some work to convert this to a number.
url[1] %>%
read_html() %>%
html_nodes(".price") %>%
html_text()
## [1] "$5699 USD (MSRP)"
And the description:
url[1] %>%
read_html() %>%
html_nodes(".platform-copy p") %>%
html_text()
## [1] "Achieving gravel greatness takes hard work and tenacity—and a gravel race bike that can hang with you every pedal stroke of the way. For that, there’s the Warbird Carbon GRX 810 Di2. Our comfortable and efficient Class 5™ VRS-equipped high-modulus carbon fiber frameset and Shimano’s refined electronic Di2 shifting ensure that no effort is wasted on the gravel-strewn road to the finish line."
I could even pull out an entire table of information. This one took some digging to find. I encourage you to walk through row by row to try to better understand what each line of code does.
url[1] %>%
read_html() %>%
html_nodes("table") %>%
.[7] %>%
html_table() %>%
.[[1]]
I would likely want to do this for every bike. I can do this in a nice way and bring it back in a single vector using some functions from the purrr library, including map() and flatten_chr(). This takes a while to run (> 1 minute). I’m sure there is a more efficient way, but I don’t know what it is.
url %>%
purrr::map(
function(x)
read_html(x) %>%
html_nodes(".price") %>%
html_text()
) %>%
flatten_chr()